Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning

نویسندگان

چکیده

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training GPU clusters. However, existing algorithms require to compute and communicate a large volume of information, i.e., factors (KFs), before preconditioning gradients, resulting in computation communication overheads as well high memory footprint. In this paper, we propose DP-KFAC, novel distributed scheme that distributes KF constructing tasks at different DNN layers workers. DP-KFAC not only retains convergence property but also enables three benefits: reduced overhead KFs, no low Extensive experiments 64-GPU cluster show reduces by 1.55×-1.65×, cost 2.79×-3.15×, footprint 1.14×-1.47× each update compared state-of-the-art methods. Our codes are available https://github.com/lzhangbv/kfac\_pytorch .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive dropout for training deep neural networks

Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adapt...

متن کامل

Exploring Strategies for Training Deep Neural Networks

Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise u...

متن کامل

Distributed Newton Methods for Deep Neural Networks

Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this...

متن کامل

Scalable Bayesian Optimization Using Deep Neural Networks

Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs sc...

متن کامل

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NT...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Cloud Computing

سال: 2022

ISSN: ['2168-7161', '2372-0018']

DOI: https://doi.org/10.1109/tcc.2022.3205918